Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use weak references to refer to the original data in the accessor #880

Closed
freddyaboulton opened this issue May 3, 2021 · 0 comments · Fixed by #894
Closed

Use weak references to refer to the original data in the accessor #880

freddyaboulton opened this issue May 3, 2021 · 0 comments · Fixed by #894
Assignees
Labels
evalml EvalML request needs design Issues requiring design documentation. new feature suggestions for new functionality

Comments

@freddyaboulton
Copy link

freddyaboulton commented May 3, 2021

As a user of woodwork, I noticed that the table and column accessors use a strong reference to the original dataframe/series. This prevents the garbage collector from freeing up the memory taken up by the original data because the reference count is always at least 1 since the accessor always points to the original data. We should use a weak reference to allow the garbage collector to free up the memory. To see how this would work, see #881

In order to convince myself this was happening, I used the following script I took from this blog post.

import gc
import pandas as pd
import woodwork as ww


def dump_garbage():
    gc.collect()

    print("\nUNCOLLECTABLE OBJECTS:")
    for x in gc.garbage:
        s = str(x)
        if len(s) > 80:
            s = s[:77]+'...'
        print(type(x), "\n  ", s)


def non_leaky_list():
    l = []
    l.append(1)
    del l
    dump_garbage()


def leaky_list():
    l = []
    l.append(l)
    del l
    dump_garbage()


def make_dataframe():
    df = pd.DataFrame({"a": [1, 2, 3],
                       "b": [4, 5, 6]})
    del df
    dump_garbage()


def make_ww_dataframe():
    df = pd.DataFrame({"a": [1, 2, 3],
                       "b": [4, 5, 6]})
    df.ww.init()
    del df
    dump_garbage()


if __name__ == "__main__":
    gc.enable()
    gc.set_debug(gc.DEBUG_SAVEALL)
    print("Non Leaky List")
    non_leaky_list()
    print("Leaky list")
    leaky_list()
    print("Make a dataframe")
    make_dataframe()
    print("Make a WW dataframe")
    make_ww_dataframe()

The leaky_list and non_leaky_list functions were added to sanity check that only leaky objects appear in gc.garbage.

The output should be this - and we can see the pandas dataframe with woodwork is now "uncollectable":

Non Leaky List

UNCOLLECTABLE OBJECTS:
Leaky list

UNCOLLECTABLE OBJECTS:
<class 'list'>
   [[...]]
Make a dataframe

UNCOLLECTABLE OBJECTS:
<class 'list'>
   [[...]]
Make a WW dataframe

UNCOLLECTABLE OBJECTS:
<class 'list'>
   [[...]]
<class 'pandas.core.frame.DataFrame'>
      a  b
0  1  4
1  2  5
2  3  6
<class 'pandas.core.indexes.base.Index'>
   Index(['a', 'b'], dtype='object')
<class 'dict'>
   {'_data': array(['a', 'b'], dtype=object), '_index_data': array(['a', 'b'], d...
<class 'pandas.core.indexes.range.RangeIndex'>
   RangeIndex(start=0, stop=3, step=1)
<class 'dict'>
   {'_range': range(0, 3), '_name': None, '_cache': {}, '_id': <object object at...
<class 'pandas.core.internals.blocks.IntBlock'>
   IntBlock: slice(0, 2, 1), 2 x 3, dtype: int64
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 2, 1))
<class 'slice'>
   slice(0, 2, 1)
<class 'pandas.core.internals.managers.BlockManager'>
   BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(star...
<class 'list'>
   [Index(['a', 'b'], dtype='object'), RangeIndex(start=0, stop=3, step=1)]
<class 'tuple'>
   (IntBlock: slice(0, 2, 1), 2 x 3, dtype: int64,)
<class 'dict'>
   {'_is_copy': None, '_mgr': BlockManager
Items: Index(['a', 'b'], dtype='objec...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb968458db0; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb968458db0; dead>}
<class 'woodwork.table_accessor.PandasTableAccessor'>
          Physical Type Logical Type Semantic Tag(s)
Column                     ...
<class 'dict'>
   {'_dataframe':    a  b
0  1  4
1  2  5
2  3  6, '_schema':        Logical Typ...
<class 'cell'>
   <cell at 0x7fb968490e20: numpy.ndarray object at 0x7fb968466930>
<class 'tuple'>
   (<cell at 0x7fb968490e20: numpy.ndarray object at 0x7fb968466930>,)
<class 'function'>
   <function Index._engine.<locals>.<lambda> at 0x7fb9703ae9d0>
<class 'pandas._libs.index.ObjectEngine'>
   <pandas._libs.index.ObjectEngine object at 0x7fb8f00ac090>
<class 'dict'>
   {'_engine': <pandas._libs.index.ObjectEngine object at 0x7fb8f00ac090>, 'is_u...
<class 'pandas.core.internals.blocks.IntBlock'>
   IntBlock: 3 dtype: int64
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 3, 1))
<class 'slice'>
   slice(0, 3, 1)
<class 'pandas.core.internals.managers.SingleBlockManager'>
   SingleBlockManager
Items: RangeIndex(start=0, stop=3, step=1)
IntBlock: 3 dty...
<class 'list'>
   [RangeIndex(start=0, stop=3, step=1)]
<class 'tuple'>
   (IntBlock: 3 dtype: int64,)
<class 'pandas.core.series.Series'>
   0    1
1    2
2    3
Name: a, dtype: int64
<class 'dict'>
   {'_is_copy': None, '_mgr': SingleBlockManager
Items: RangeIndex(start=0, stop...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb9703aa680; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb9703aa680; dead>}
<class 'dict'>
   {'a': 0    1
1    2
2    3
Name: a, dtype: int64, 'b': 0    4
1    5
2    6
N...
<class 'tuple'>
   ('a', <weakref at 0x7fb968458db0; dead>)
<class 'cell'>
   <cell at 0x7fb9804f9df0: function object at 0x7fb9703ae940>
<class 'cell'>
   <cell at 0x7fb9804f9dc0: TypeSystem object at 0x7fb99030d6d0>
<class 'list'>
   [IntegerNullable, Integer]
<class 'tuple'>
   ([IntegerNullable, Integer],)
<class 'tuple'>
   (<cell at 0x7fb9804f9df0: function object at 0x7fb9703ae940>, <cell at 0x7fb9...
<class 'function'>
   <function TypeSystem.infer_logical_type.<locals>.get_inference_matches at 0x7...
<class 'pandas.core.internals.blocks.IntBlock'>
   IntBlock: 3 dtype: int64
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 3, 1))
<class 'slice'>
   slice(0, 3, 1)
<class 'pandas.core.internals.managers.SingleBlockManager'>
   SingleBlockManager
Items: RangeIndex(start=0, stop=3, step=1)
IntBlock: 3 dty...
<class 'list'>
   [RangeIndex(start=0, stop=3, step=1)]
<class 'tuple'>
   (IntBlock: 3 dtype: int64,)
<class 'pandas.core.series.Series'>
   0    4
1    5
2    6
Name: b, dtype: int64
<class 'dict'>
   {'_is_copy': None, '_mgr': SingleBlockManager
Items: RangeIndex(start=0, stop...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb98054a810; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb98054a810; dead>}
<class 'tuple'>
   ('b', <weakref at 0x7fb968458db0; dead>)
<class 'cell'>
   <cell at 0x7fb9804f97f0: function object at 0x7fb9703ae790>
<class 'cell'>
   <cell at 0x7fb9804f9820: TypeSystem object at 0x7fb99030d6d0>
<class 'list'>
   [IntegerNullable, Integer]
<class 'tuple'>
   ([IntegerNullable, Integer],)
<class 'tuple'>
   (<cell at 0x7fb9804f97f0: function object at 0x7fb9703ae790>, <cell at 0x7fb9...
<class 'function'>
   <function TypeSystem.infer_logical_type.<locals>.get_inference_matches at 0x7...
<class 'woodwork.table_schema.TableSchema'>
          Logical Type Semantic Tag(s)
Column
a    ...
<class 'woodwork.column_schema.ColumnSchema'>
   <ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
<class 'dict'>
   {'metadata': {}, 'description': None, 'logical_type': Integer, 'use_standard_...
<class 'set'>
   {'numeric'}
<class 'dict'>
   {'a': <ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>, ...
<class 'woodwork.column_schema.ColumnSchema'>
   <ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
<class 'dict'>
   {'metadata': {}, 'description': None, 'logical_type': Integer, 'use_standard_...
<class 'set'>
   {'numeric'}
<class 'dict'>
   {'name': None, 'columns': {'a': <ColumnSchema (Logical Type = Integer) (Seman...
<class 'pandas.core.series.Series'>
   0       a  b
1    0  1  4
2    1  2  5
3    2  3  6
dtype: object
<class 'pandas.core.indexes.range.RangeIndex'>
   RangeIndex(start=0, stop=4, step=1)
<class 'dict'>
   {'_range': range(0, 4), '_name': None, '_cache': {}, '_id': <object object at...
<class 'pandas.core.internals.blocks.ObjectBlock'>
   ObjectBlock: 4 dtype: object
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 4, 1))
<class 'slice'>
   slice(0, 4, 1)
<class 'pandas.core.internals.managers.SingleBlockManager'>
   SingleBlockManager
Items: RangeIndex(start=0, stop=4, step=1)
ObjectBlock: 4 ...
<class 'list'>
   [RangeIndex(start=0, stop=4, step=1)]
<class 'tuple'>
   (ObjectBlock: 4 dtype: object,)
<class 'dict'>
   {'_is_copy': None, '_mgr': SingleBlockManager
Items: RangeIndex(start=0, stop...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb9607e4270; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb9607e4270; dead>}
<class 'pandas.core.strings.accessor.StringMethods'>
   <pandas.core.strings.accessor.StringMethods object at 0x7fb9601e38e0>
<class 'dict'>
   {'_inferred_dtype': 'string', '_is_categorical': False, '_is_string': False, ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evalml EvalML request needs design Issues requiring design documentation. new feature suggestions for new functionality
Projects
None yet
3 participants